Zurich by the Numbers - Predictive Insights into Tourism Dynamic
Authors
Affiliation
Name I, First Name I
University of Lausanne
Name II, First Name II
Published
May 8, 2024
Abstract
The following Forecasting project focuses on applying forecasting techniques to predict tourism trends in Zurich. This analysis aims to harness the power of historical data combined with forecasting algorithms to provide actionable insights into future tourism patterns. We engage in comprehensive data preparation, explore various predictive models, and conduct a detailed evaluation of their forecasting accuracy. The project encapsulates the challenge of turning complex data into understandable and strategic information, crucial for effective decision-making in Zurich’s tourism sector.
1 Exploration & Visualization
1.1 Objectives
The main objectives of this project is to predict :
The overnight stays of the visitors in Vaud, from October 2023 until December 2024.
The overnight satys of visitors from Philippines to Zürich, from October 2023 until December 2024.
2 DATA
2.1 Cleaning & Wrangling
2.1.1 Tourism Data - General Overview
The dataset contains information about the overnights stays by tourists in the various Swiss cantons. It indicates the tourist’s country of origin, the canton of stay, the month, the year and the total number of overnights stays.
As the dataset provided is in German, we have translated the data in English to make it more intuitive and understandable for everyone. Then, we created a new ‘Date’ column, year-month-day, which corresponds to the correct format to be able to make predictions.
Click to show code
# Load the data in folder data named Dataset_tourism.xlsx)tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total#tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")#print unique values in month columnunique(tourism_data$Monat)#> [1] "Januar" "Februar" "März" "April" "Mai" #> [6] "Juni" "Juli" "August" "September" "Oktober" #> [11] "November" "Dezember"# change ' [1] "Januar" "Februar" "März" "April" "Mai" "Juni" "Juli" "August" "September" "Oktober" "November" "Dezember" into english month'tourism_data$Monat <- tourism_data$Monat %>%recode_factor("Januar"="January","Februar"="February","März"="March","April"="April","Mai"="May","Juni"="June","Juli"="July","August"="August","September"="September","Oktober"="October","November"="November","Dezember"="December")#add date type column for plotting purposestourism_data <- tourism_data %>%mutate(Date =dmy(paste("01", Monat, Jahr)))# filtering out 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the totaltourism_data_no_total <- tourism_data %>%filter(Herkunftsland !="Herkunftsland - Total")#check for NANsum(is.na(tourism_data_no_total))#> [1] 51395#analyse the NAN values, where are they(tourism_data_no_total %>%filter(is.na(value)))#> # A tibble: 51,395 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Malta Schwe~ Janu~ 2005 NA 2005-01-01#> 2 Zypern Schwe~ Janu~ 2005 NA 2005-01-01#> 3 Mexiko Schwe~ Janu~ 2005 NA 2005-01-01#> 4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005 NA 2005-01-01#> 5 Bahrain Schwe~ Janu~ 2005 NA 2005-01-01#> 6 Katar Schwe~ Janu~ 2005 NA 2005-01-01#> 7 Kuwait Schwe~ Janu~ 2005 NA 2005-01-01#> 8 Australien Schwe~ Janu~ 2005 NA 2005-01-01#> 9 Neuseeland, Ozeanien Schwe~ Janu~ 2005 NA 2005-01-01#> 10 Oman Schwe~ Janu~ 2005 NA 2005-01-01#> # i 51,385 more rows#show data using reactable only showing the first 100 rowsreactable::reactable(head(tourism_data_no_total, 1000), searchable =TRUE)
2.1.2 Tourism Data - Vaud
Click to show code
# Filter by canton Vaud tourism_vaud <- tourism_data %>%filter(Kanton =="Vaud")#check for NANsum(is.na(tourism_vaud))#> [1] 1869#show the data in a table using reactablereactable::reactable(head(tourism_vaud, 20))
2.1.3 Tourism Data - Zurich
We filtered the ‘Canton’ column to keep only the canton of Zurich.
Click to show code
#filter column 'Kanton' for Zurichtourism_data_zurich <- tourism_data_no_total %>%filter(Kanton =="Zürich")#check for NANsum(is.na(tourism_data_zurich))#> [1] 1869#analyse the NAN values, where are theytourism_data_zurich %>%filter(is.na(value))#> # A tibble: 1,869 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Malta Zürich Janu~ 2005 NA 2005-01-01#> 2 Zypern Zürich Janu~ 2005 NA 2005-01-01#> 3 Mexiko Zürich Janu~ 2005 NA 2005-01-01#> 4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005 NA 2005-01-01#> 5 Bahrain Zürich Janu~ 2005 NA 2005-01-01#> 6 Katar Zürich Janu~ 2005 NA 2005-01-01#> 7 Kuwait Zürich Janu~ 2005 NA 2005-01-01#> 8 Australien Zürich Janu~ 2005 NA 2005-01-01#> 9 Neuseeland, Ozeanien Zürich Janu~ 2005 NA 2005-01-01#> 10 Oman Zürich Janu~ 2005 NA 2005-01-01#> # i 1,859 more rows#show the data in a table using reactablereactable::reactable(head(tourism_data_zurich, 1000))
2.1.4 Tourism Data - Zurich and Philipines
Same as above, but we’re also filtering the country of origin, keeping only ‘Philippinen’ as that is what we’re aiming for.
Click to show code
tourism_data_zurich_philippines <- tourism_data_zurich %>%filter(Herkunftsland =="Philippinen")#show table using reactablereactable::reactable(tourism_data_zurich_philippines)
2.1.5 Deal with NAN
A supprimer ou juste mentionner quon a pas de NAN We have none in the data filtered with Zurich and Philippines, but if we would have we would : Impute missing values ARIMA
If the missing values are random or if excluding them would result in a loss of valuable information, we might consider imputing them. One common approach is to use statistical models like ARIMA to interpolate missing values based on the patterns observed in the available data.
Click to show code
# #Creating a tsibble with missing values# data <- tourism_data_zurich_philippines %>%# as_tsibble(key = c(Kanton, Herkunftsland, Monat, Jahr)) %>%# select(Date, value) %>%# fill_gaps()# # # Fit an ARIMA model to data with missing values# model_fit <- data %>%# model(ARIMA(value))# # # Interpolate missing values using the fitted ARIMA model# filled_data <- model_fit %>%# interpolate(data)# # # Print the data with filled in missing values# print(filled_data)
3 EDA - Vaud
3.1 Visitors from different countries in Vaud
Click to show code
plot_vaud <- tourism_vaud %>%filter(Herkunftsland !='Herkunftsland - Total') %>%ggplot(aes(x = Date, y = value, group = Herkunftsland, color = Herkunftsland,text =paste("Country:", Herkunftsland, "Trips:", value))) +# Added text for tooltipgeom_line(show.legend =FALSE) +labs(title ="Number of visitors from Each Country to Vaud",x ="Date",y ="Number of Trips") +theme_minimal()# Convert to an interactive plotly objectinteractive_plot <-ggplotly(plot_vaud, tooltip ="text")# Adjust plotly settings interactive_plot <- interactive_plot %>%layout(margin =list(l =60, r =60, b =60, t =80), # Adjust marginslegend =list(orientation ="h", x =0, xanchor ="left", y =-0.2)) # Adjust legend position# Display the interactive plotinteractive_plot
According to the graph the most tourist in canton Vaud are Swiss and France.
Graphical view of total number of tourists in canton Vaud
Click to show code
head(tourism_vaud)#> # A tibble: 6 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Herkunftsland - Total Vaud January 2005 57942 2005-01-01#> 2 Schweiz Vaud January 2005 27853 2005-01-01#> 3 Baltische Staaten Vaud January 2005 103 2005-01-01#> 4 Deutschland Vaud January 2005 3052 2005-01-01#> 5 Frankreich Vaud January 2005 7667 2005-01-01#> 6 Italien Vaud January 2005 2154 2005-01-01tourism_vaud_total <- tourism_vaud %>%filter(Herkunftsland =='Herkunftsland - Total') %>%select(-c(Herkunftsland, Kanton, Monat, Jahr))tourism_vaud_total%>%ggplot(aes(x = Date, y = value)) +geom_line() +labs(x ="Date", y ="Number of tourists", title ="Total number of tourists in canton Vaud") +theme_minimal()
3.1.1 Decomposition
Click to show code
# Convert data to a time series objectvaud_ts <- tourism_vaud_total %>%arrange(Date) %>%# Filtre pour enlever les valeurs NA dans 'Date'filter(!is.na(Date)) %>%# Ensure data is complete and monthlycomplete(Date =seq.Date(min(Date, na.rm =TRUE), max(Date, na.rm =TRUE), by ="month")) %>%replace_na(list(value =0)) %>%# Replace NA values if there are any# Create a time series objectwith(ts(value, frequency =12, start =decimal_date(min(Date, na.rm =TRUE))))# Decompose the time seriesvaud_ts %>%decompose() %>%plot()
We see clear seasonality with picks in summer and dip in January.
We saw a big drop in March 2020 due to the global coronavirus pandemic.
4 EDA - Zurich
4.1 Zurich and All visiting countries
Click to show code
# Preparing the data#removing value 'Schweiz' in column 'Herkunftsland' as it is just the whole of Switzerlanddata <- tourism_data_zurich %>%filter(!is.na(value)) %>%# Removing rows with NA values in the 'value' columnmutate(Monat =month(Date, label =TRUE, abbr =TRUE), # Extract month from DateJahr =year(Date)) %>%# Extract year from Dategroup_by(Herkunftsland, Date) %>%# Group by country and datesummarise(Trips =sum(value), .groups ='drop') # Summing up trips for each country per datep <-ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,color = Herkunftsland =="Philippinen",text =paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +# Added text for tooltipgeom_line(show.legend =FALSE) +scale_color_manual(values =c("TRUE"="red", "FALSE"="grey")) +labs(title ="Number of Trips from Each Country to Zurich",x ="Date",y ="Number of Trips") +theme_minimal()# Convert to an interactive plotly objectinteractive_plot <-ggplotly(p, tooltip ="text")# Adjust plotly settings interactive_plot <- interactive_plot %>%layout(margin =list(l =60, r =60, b =60, t =80), # Adjust marginslegend =list(orientation ="h", x =0, xanchor ="left", y =-0.2),width =600,height =400) # Adjust legend position#> Warning: Specifying width/height in layout() is now deprecated.#> Please specify in ggplotly() or plot_ly()# Display the interactive plotinteractive_plot
This graph shows changes in the number of overnight stays from January 2005 to September 2023.
The red curve represents the Philippines. It is flat and shows a considerably small number of trips from this country to Zurich over the period. The grey curves represent other countries visiting the canton of Zurich. Swiss people are the most frequent visitors, followed by Germany and the United States.
A drastic drop can be observed. This could be due to the COVID-19 pandemic, which had a significant impact on international travel and thus on tourism industry worldwide.
At first glance, we can observe regular seasonal peaks for most countries, suggesting the presence of seasonality in tourism to the canton of Zurich.
4.2 Zurich and Philippinens Visitors
This graph shows only visitors from the Philippines, as this is the country of interest in our analysis.
Click to show code
# use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axisp <-ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +geom_line() +labs(title ="Number of Trips from Philipines to Zurich",x ="Date",y ="Number of Trips") +theme_minimal()p
Slight seasonality
General upward trend in the number of nights spent before 2020.
Very sharp fall in 2020 - covid
General upward trend after the fall
4.2.1 Pattern
4.2.1.1 Decomposition
We apply a standard decomposition of the number of overnight stays in Zurich for the Philippines to reveal seasonal trends, long-term trends and irregular components of the data.
Splitting the times series into several components allows us to understand how the different components contribute to the variations observed in Swiss tourism data.
We chose to split the times series into monthly data.
Click to show code
# Convert data to a time series objecttourism_ts <- tourism_data_zurich_philippines %>%arrange(Date) %>%# Ensure data is complete and monthlycomplete(Date =seq.Date(min(Date), max(Date), by ="month")) %>%replace_na(list(value =0)) %>%# Replace NA values if there are any# Create a time series objectwith(ts(value, frequency =12, start =decimal_date(min(Date))))# Decompose the time seriesdecomposed <-decompose(tourism_ts)# Plot the decomposed componentsplot(decomposed)
4.2.1.2 Seasonality
Click to show code
# Plot the seasonality in one chart ggseasonplot(tourism_ts, year.labels =TRUE, year.labels.left =TRUE)
Click to show code
# several chart per month to see the seasonalityggsubseriesplot(tourism_ts) +ylab("Number of tourists") +xlab("Month") +ggtitle("Seasonal subseries plot")#debug#better to use gg_subseries to see the seasonality#tourism_ts %>% gg_subseries(value) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")
4.2.1.3 Trend
Juste ecrire text sur le upward trend qu’on a vu dans la decomposition
5 MODELLING
This part is about building on your knowledge of time series techniques to model your data. You can investigate various models but you should justify in your report your choices regarding these. Pay attention to the conditions that are needed to apply a specific model. Treat also carefully seasonality, outliers, colinearity, covariates, special events, etc. Remember the following steps:
Aggregation choice for hierarchical time series
Model building
Model selection
5.1 Total number of visitors in Vaud
5.1.1 Outliers, Correlation, Colinearity, Covariates, Special Events ?
Questions ?
5.1.2 ETS model
Click to show code
ets_vaud <-ets(vaud_ts, model ="AAA")forecast_ets_vaud <-forecast(ets_vaud, h =24) %>%plot(main ="Forecast of visitors in Vaud", xlab ="Date", ylab ="Number of visitors")
5.1.3 ARIMA
Click to show code
arima_vaud <-auto.arima(vaud_ts, seasonal =TRUE)# Generate forecasts for the next 2 years (24 months)forecast_arima_vaud <-forecast(arima_vaud, h =24)# Plot the forecastplot(forecast_arima_vaud, main ="ARIMA Forecast for Vaud Tourism", xlab ="Date", ylab ="Number of Tourists")
Question: Forecast for each country or general forecast?
5.2 Zurich and Philipines
5.2.1 Forecast without dealing with Covid
5.2.1.1 Naive Forecast
Click to show code
#convert tourism_ts to tsibbletourism_ts <- tourism_ts %>%as_tsibble()# Fit a naive modelfit <- tourism_ts %>%model(NAIVE =NAIVE(value))# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =24)# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposesplot <- forecast %>%autoplot(tourism_ts, alpha =0.5) +labs(title ="Forecast of tourists from Philipines to Zurich",x ="Date",y ="Number of tourists") +guides(colour =guide_legend(title ="Forecast"))plot
5.2.1.2 Exponential Smoothing
Why additive errors ? Because the variance of the errors is constant over time no ?
Click to show code
# Fit an ETS model# Adjusting the model parameters according to the characteristics of the data# Here "A" means additive error, "N" means no trend, and "N" means no seasonality# change these if neededfit <- tourism_ts %>%model(ETS_M_seaso =ETS(value ~error("A") +trend("A") +season("A"))) #multiplicative seasonality)# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =24)# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposesplot <- forecast %>%autoplot(tourism_ts, alpha =0.5) +labs(title ="Forecast of tourists from Philipines to Zurich",x ="Date",y ="Number of tourists") +guides(colour =guide_legend(title ="Forecast"))plot
Clearly see here that the confidence interval is too big, almost like a naive forecast
Why trend dp and seaso M ? - Trend is present so A - Seasonality is present and growing over time so Multiplicative was chosen
Click to show code
# comparing several modelfit <- tourism_ts %>%model(ETS_M_seaso =ETS(value ~error("A") +trend("A") +season("M")), #multiplicative seasonalityETS_M_seaso_Ad =ETS(value ~error("A") +trend("Ad") +season("M")), #dampted trend )# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =24)# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposesplot <- forecast %>%autoplot(tourism_ts, level =90, color ="blue", alpha =0.5) +labs(title ="Forecast of tourists from Philipines to Zurich",x ="Date",y ="Number of tourists") +guides(colour =guide_legend(title ="Forecast"))plot
5.2.1.2.1 Thus Chosen Model :
Click to show code
fit <- tourism_ts %>%model(ETS_M_seaso =ETS(value ~error("A") +trend("Ad") +season("M"))) #multiplicative seasonality)# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =24)# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposesplot <- forecast %>%autoplot(tourism_ts, alpha =0.5) +labs(title ="Forecast of tourists from Philipines to Zurich",x ="Date",y ="Number of tourists") +guides(colour =guide_legend(title ="Forecast"))plot
5.2.1.3 ARIMA
Question : Do we need to differentiate the data ?
Click to show code
# Fit an automatic ARIMA modelfit_arima <- tourism_ts %>%model(ARIMA_auto =ARIMA(value))# Forecast the next 2 years (24 months)forecast_arima <- fit_arima %>%forecast(h =24)# Plot the forecasts along with the historical dataplot_arima <- forecast_arima %>%autoplot(tourism_ts, alpha =0.5) +labs(title ="ARIMA Forecast of Tourists from the Philippines to Zurich",x ="Date",y ="Number of Tourists") +guides(colour =guide_legend(title ="Forecast"))plot_arima
Ugly forecast, confidence interval is too big
5.2.1.4 Diagnostic
Click to show code
# Check the diagnostics of the ARIMA modelreport <- fit_arima %>%report()#> Series: value #> Model: ARIMA(0,1,4)(0,0,2)[12] #> #> Coefficients:#> ma1 ma2 ma3 ma4 sma1 sma2#> -0.3889 0.0978 -0.1636 -0.155 0.418 0.1582#> s.e. 0.0693 0.0809 0.0638 0.095 0.086 0.0688#> #> sigma^2 estimated as 13178: log likelihood=-1379#> AIC=2771 AICc=2772 BIC=2795autoplot(residuals(fit_arima))#show report in html
Click to show code
# using auto.arima with stepwise and approximation options turned off for a more thorough searchfit_updated <-auto.arima(tourism_ts, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)summary(fit_updated)#> Series: tourism_ts #> ARIMA(4,1,0)(1,0,0)[12] #> #> Coefficients:#> ar1 ar2 ar3 ar4 sar1#> -0.407 -0.161 -0.236 -0.270 0.462#> s.e. 0.064 0.071 0.073 0.066 0.074#> #> sigma^2 = 12436: log likelihood = -1373#> AIC=2758 AICc=2758 BIC=2778#> #> Training set error measures:#> ME RMSE MAE MPE MAPE MASE ACF1#> Training set 5.55 110 66.9 -77.7 147 0.589 -0.041